Ohio State University 2023 Earnings - EDA and Models¶

This project provides an exploratory data analysis (EDA) of The Ohio State University 2023 Combined Earnings dataset. It explores the structure of the data, investigates object and numerical variables, and delves into gender pay gap analysis through various visualizations and statistical techniques. This repository includes not only the EDA but also model implementations to identify outliers and patterns in the earnings data.

Table of Contents¶

  1. Import Necessary Libraries
  2. About the Dataset
  3. Load the Dataset
  4. Overview of the Dataset
  5. Presence of negative values
    • 5.1 Possible explanations
    • 5.2 Compare to data from other university
  6. Analyze Object Data
    • 6.1 Analyze frequencies/percentages of some data
      • 6.1.1 Gender
      • 6.1.2 Position Groups
    • 6.2 Gender and Position Groups
  7. Analyze Numerical Data
    • 7.1 Overview
      • 7.1.1 Visualizations of Distributions
      • 7.1.2 Observations
    • 7.2 Highest Paying Jobs
    • 7.3 Gross Pay Categories Distribution
    • 7.4 Gross Pay by Position Group
      • 7.4.1 Mean and Median Gross Pay by Position Group
      • 7.4.2 Mean and Median Gross Pay by Gender and Position Group
    • 7.5 Gross Pay and Gender
      • 7.5.1 Mean Gross Pay by Gender
      • 7.5.2 Gross Pay Categories by Gender
  8. Outliers
    • 8.1 Boxplot visualization
      • 8.1.1 Boxplots by Position Group
      • 8.1.2 Boxplots by Position Group and Gender
    • 8.2 Skewness and Kurtosis
    • 8.3 Correlation
    • 8.4 Bivariate Outliers: Gross Pay and Regular Pay
      • 8.4.1 Observations
    • 8.5 Using Isolation Forest to Find Outliers
    • 8.6 Conclusion on Outliers
  9. Models
  10. Final Analysis

Import Necessary Libraries¶

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from mpl_toolkits.mplot3d import Axes3D
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
In [2]:
# Import Scikit-learn for Machine Learning libraries
from sklearn.linear_model import (LinearRegression, HuberRegressor, RANSACRegressor)
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import (RandomForestRegressor, IsolationForest)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error 
In [3]:
# See all column names & rows when you doing .head(). None of the column name will be truncated.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.mode.copy_on_write = True

About the dataset¶

The Ohio State University 2023 Combined Earnings dataset is taken from its website at https://apps.hr.osu.edu/salaries/

It has been changed to suit analysis purposes.

Statistics and Reports
The Office of Human Resources provides analysis and reporting of human resources management information, including demographics, internal and external markets for faculty and staff salaries, and mandated federal and state reporting.

Earnings
Earnings dataset includes all non-student employees’ regular, overtime, bonus and sick leave/vacation payout for the timeframe. For individuals with more than one job title/appointment, their earnings are listed in the area that corresponds to their primary appointment.

About the Data
Amounts represent paid earnings, including corrections to prior payrolls and in some cases may reflect negative values.

  • Regular: components of the employee’s regular pay, including paid leave
  • Overtime: includes call back pay, FLSA premium and holiday pay (worked)
  • Bonus: includes staff award, STEP faculty and STEP staff
  • Other: remaining components of the employee’s pay

Form Field Definitions
Database: Differentiates positions associated with the Wexner Medical Center/Health System, Athletics, and the rest of the university. The Ohio State Athletics Department operates a self-sustaining budget, receiving no University general funds, student fees or state tax support. The Wexner Medical Center/Health System operates a self-sustaining budget, receiving no University general funds, student fees or state tax support.

Resulting Data: Earnings data includes all non-student employees’ regular, overtime, bonus and sick leave/vacation payout for 2023. For individuals with more than one job title/appointment, their earnings are listed in the area that corresponds to their primary appointment.

The original file contained the following columns:

  • Last Name - Legal
  • First Name - Preferred
  • Job Profile Name
  • Cost Center
  • Cost Center Hierarchy CCH6
  • Position Group
  • Regular Pay (Base Pay + Paid Time Off + Premium)
  • Bonus
  • Overtime
  • Other (Allowance + Supplemental + Uncategorized)
  • Gross Pay

The modified file contains the following columns:

  • full_name
  • gender
  • position_group
  • job hierarchy
  • cost_center
  • regular_pay
  • bonus
  • other
  • overtime
  • gross_pay

Load the Dataset¶

In [4]:
df = pd.read_excel("Data/OSU-2023-Earnings-Cleaned.xlsx")

Overview of the dataset to understand its structure and contents.¶

In [5]:
# View the first few rows
df.head(3)
Out[5]:
full_name gender position_group job hierarchy cost_center regular_pay bonus other overtime gross_pay
0 Eric Green Male University Professor - Clinical Veterinary Medicine CCH6 CC10587 Veterinary Medicine | Veterinary Clini... 181225.86 29332.76 3000.0 0.0 213558.62
1 Chris Brew Male University Visiting Associate Professor Engineering CCH6 CC11786 Engineering | Computer Science and Eng... 7429.90 0.00 0.0 0.0 7429.90
2 Robin Judd Male University Associate Professor (9M) Arts and Sciences CCH6 CC12460 Arts and Sciences | History 111042.26 0.00 3000.0 0.0 114042.26
In [6]:
df.describe()
Out[6]:
regular_pay bonus other overtime gross_pay
count 4.188700e+04 4.188700e+04 4.188700e+04 41887.000000 4.188700e+04
mean 6.878618e+04 2.340076e+03 2.783559e+03 1081.562495 7.499138e+04
std 7.481114e+04 1.609100e+04 3.749963e+04 4256.524367 1.002738e+05
min -2.185730e+03 -1.000000e+03 -2.500000e+03 -14.990000 2.000000e-02
25% 2.760819e+04 0.000000e+00 0.000000e+00 0.000000 2.915501e+04
50% 5.504043e+04 0.000000e+00 0.000000e+00 0.000000 5.772600e+04
75% 8.736613e+04 6.000000e+02 1.788000e+01 192.465000 9.145901e+04
max 2.433750e+06 1.385738e+06 6.923463e+06 152487.380000 9.173463e+06
In [7]:
# Overview of data types and missing values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41887 entries, 0 to 41886
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   full_name       41887 non-null  object 
 1   gender          41887 non-null  object 
 2   position_group  41887 non-null  object 
 3   job             41887 non-null  object 
 4   hierarchy       41887 non-null  object 
 5   cost_center     41887 non-null  object 
 6   regular_pay     41887 non-null  float64
 7   bonus           41887 non-null  float64
 8   other           41887 non-null  float64
 9   overtime        41887 non-null  float64
 10  gross_pay       41887 non-null  float64
dtypes: float64(5), object(6)
memory usage: 3.5+ MB

There are no missing values and each column has the correct data type.

Presence of negative values¶

A commentary to the data states: "Amounts represent paid earnings, including corrections to prior payrolls and in some cases may reflect negative values." Nevertheless, it concerns me that the amount of negative values in some categories account for up to 70 percent of all observations.

Possible explanations¶

If there are no data entry errors, several factors specific to the nature of the data or the institution's practices might explain this

  • Adjustments and Corrections
    • Widespread Overpayments If there were systemic overpayments to a large number of employees, the subsequent adjustments to correct these overpayments could result in a high number of negative values.
    • Year-End Adjustments: Annual adjustments to reconcile accounts might be recorded as negative values if the initial estimates were higher than the final amounts.

  • Deductions and Recoveries:
    • Loan Repayments If a significant portion of the workforce has taken advances or loans that are being repaid through payroll deductions, these deductions might be recorded as negative values.
    • Benefit Contributions: High and widespread contributions to benefit plans (e.g., retirement, healthcare) that are deducted from salaries might appear as negative values in the data.

  • Special Payroll Accounting Practices:
    • Contra Accounts: If the institution uses contra accounts to offset various payroll components, many values might be recorded as negative to reflect these offsets.
    • Reimbursements: Regular reimbursements or refunds, if recorded as negative values, could contribute significantly to the overall dataset.

  • Leave and Absence Policies:
    • Unpaid Leave: If a large number of employees take unpaid leave, the adjustments might result in negative values in payroll records.
    • Clawbacks: Regular clawbacks for benefits or other allowances due to leave or other reasons might be recorded as negative values.

  • Employee Turnover and Final Settlements:
    • High Turnover: If there is high employee turnover, final settlements might include negative values due to recoveries or adjustments for previously issued benefits or pay.

  • Large-Scale Financial Adjustments:
    • Budget Cuts or Financial Reconciliation: If the institution is undergoing budget cuts or financial reconciliation, it might result in a large number of negative adjustments to payroll data.

Compare to data from other university¶

I happen to have similar data from Ohio University to examine and compare with OSU data.

In [8]:
ohio_uni = pd.read_excel("Data/Compensation Data Final 2023.xlsx")
In [9]:
ohio_uni.describe()
Out[9]:
APPOINTMENT_TYPE_MONTHS MEDICARE WAGES ADDL_COMP BASE_SALARY
count 3140.000000 5057.000000 4733.000000 5057.000000
mean 11.007325 58096.326674 10360.860137 49438.688784
std 1.398451 49778.583247 18010.894424 50937.556410
min 9.000000 17.040000 1.610000 0.000000
25% 9.000000 25201.370000 2000.000000 20.170000
50% 12.000000 50531.240000 3905.000000 52020.000000
75% 12.000000 77610.670000 11090.000000 78030.000000
max 12.000000 730549.400000 219476.600000 603000.000000

Observations:

As we can see, no negative values are present in any of the columns in the Ohio University dataset (which may suggest a more straightforward or standardized accounting process, or it could indicate better data quality controls).

Also, Ohio State University seems to have a wider range of earnings, with more pronounced outliers, especially in bonus and other compensations. The Ohio University dataset, while still variable, seems more consistent in terms of salary distribution (though it covers different compensation components).

This fact requires further investigation and analysis.

Analyze object data¶

In [10]:
obcol = df.select_dtypes(include=["object"])
obcol.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41887 entries, 0 to 41886
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   full_name       41887 non-null  object
 1   gender          41887 non-null  object
 2   position_group  41887 non-null  object
 3   job             41887 non-null  object
 4   hierarchy       41887 non-null  object
 5   cost_center     41887 non-null  object
dtypes: object(6)
memory usage: 1.9+ MB

Analyze frequencies/percentages of some data¶

In [11]:
oth_data = df[['gender', 'position_group', 'hierarchy']]
In [12]:
pd.options.display.float_format = '{:.2f}'.format
for col in oth_data:
    print(col, "----------------------", "frequencies", oth_data[col].value_counts(),"percentages", 
          oth_data[col].value_counts(normalize=True),
          sep="\n\n", end="\n\n\n")
gender

----------------------

frequencies

gender
Female    24639
Male      17248
Name: count, dtype: int64

percentages

gender
Female   0.59
Male     0.41
Name: proportion, dtype: float64


position_group

----------------------

frequencies

position_group
University       22102
Health System    17640
Athletics         2145
Name: count, dtype: int64

percentages

position_group
University      0.53
Health System   0.42
Athletics       0.05
Name: proportion, dtype: float64


hierarchy

----------------------

frequencies

hierarchy
Medicine CCH6                                              5600
Health System | University Hospital CCH6                   5345
Health System | James Cancer Hospital CCH6                 4474
Arts and Sciences CCH6                                     2415
Athletics and Business Advancement CCH6                    2145
Food, Agricultural, and Environmental Sciences CCH6        1955
Health System | Shared Services CCH6                       1885
Academic Affairs CCH6                                      1885
Administration and Planning CCH6                           1859
Health System | Ambulatory Services CCH6                   1520
Health System | University Hospital East CCH6              1441
Health Sciences CCH6                                       1197
Engineering CCH6                                           1135
Student Life CCH6                                          1108
Health System | Ross Heart Hospital CCH6                    950
Veterinary Medicine CCH6                                    785
Education and Human Ecology CCH6                            637
Enterprise for Research, Innovation, and Knowledge CCH6     484
University Advancement CCH6                                 456
Dentistry CCH6                                              447
Human Resources CCH6                                        404
Business CCH6                                               389
Health System | OSU Harding Hospital CCH6                   381
Nursing CCH6                                                362
Business and Finance CCH6                                   336
Pharmacy CCH6                                               249
Social Work CCH6                                            230
Newark CCH6                                                 206
Law CCH6                                                    189
Health System | Other Business Activities CCH6              185
Health System | Specialty Care Network CCH6                 156
Optometry CCH6                                              154
Marion CCH6                                                 143
Lima CCH6                                                   141
Public Health CCH6                                          139
Mansfield CCH6                                              130
John Glenn Public Affairs CCH6                              115
OSU Health Plan CCH6                                        106
Marketing and Communications CCH6                            77
Legal Affairs CCH6                                           42
President CCH6                                               14
Government Affairs CCH6                                      12
Board of Trustees CCH6                                        3
OSU Physicians CCH6                                           1
Name: count, dtype: int64

percentages

hierarchy
Medicine CCH6                                             0.13
Health System | University Hospital CCH6                  0.13
Health System | James Cancer Hospital CCH6                0.11
Arts and Sciences CCH6                                    0.06
Athletics and Business Advancement CCH6                   0.05
Food, Agricultural, and Environmental Sciences CCH6       0.05
Health System | Shared Services CCH6                      0.05
Academic Affairs CCH6                                     0.05
Administration and Planning CCH6                          0.04
Health System | Ambulatory Services CCH6                  0.04
Health System | University Hospital East CCH6             0.03
Health Sciences CCH6                                      0.03
Engineering CCH6                                          0.03
Student Life CCH6                                         0.03
Health System | Ross Heart Hospital CCH6                  0.02
Veterinary Medicine CCH6                                  0.02
Education and Human Ecology CCH6                          0.02
Enterprise for Research, Innovation, and Knowledge CCH6   0.01
University Advancement CCH6                               0.01
Dentistry CCH6                                            0.01
Human Resources CCH6                                      0.01
Business CCH6                                             0.01
Health System | OSU Harding Hospital CCH6                 0.01
Nursing CCH6                                              0.01
Business and Finance CCH6                                 0.01
Pharmacy CCH6                                             0.01
Social Work CCH6                                          0.01
Newark CCH6                                               0.00
Law CCH6                                                  0.00
Health System | Other Business Activities CCH6            0.00
Health System | Specialty Care Network CCH6               0.00
Optometry CCH6                                            0.00
Marion CCH6                                               0.00
Lima CCH6                                                 0.00
Public Health CCH6                                        0.00
Mansfield CCH6                                            0.00
John Glenn Public Affairs CCH6                            0.00
OSU Health Plan CCH6                                      0.00
Marketing and Communications CCH6                         0.00
Legal Affairs CCH6                                        0.00
President CCH6                                            0.00
Government Affairs CCH6                                   0.00
Board of Trustees CCH6                                    0.00
OSU Physicians CCH6                                       0.00
Name: proportion, dtype: float64


Gender¶

In [13]:
fig, ax = plt.subplots(1,2,figsize=(15,5), sharey=False, sharex=True, dpi = 200)
sns.despine(fig)

p1= sns.countplot(x ='gender', data = df, color = 'gold', ax=ax[0])
for container in p1.containers:
    ax[0].bar_label(container, fontsize=10)

p2 = sns.countplot(x ='gender', data = df, color = 'gold', stat = 'percent', ax=ax[1] )
for container in p2.containers:
    ax[1].bar_label(container, fontsize=10, fmt= '{:.1f}%')

p1.set(xlabel='', ylabel='')
p2.set(xlabel='', ylabel='')
ax[0].set_title('Count by Gender', pad = 7, fontsize=14)
ax[1].set_title('Percentage by Gender', pad = 7, fontsize=14)
plt.show()
No description has been provided for this image

Observations:

The dataset contains more females (24,639; 58.8%) than males (17,248; 41.2%), indicating a higher representation of females.

Position Groups¶

In [14]:
fig, ax = plt.subplots(1,2,figsize=(15,5), sharey=False, sharex=True, dpi = 200)
sns.despine(fig)

p1= sns.countplot(x ='position_group', data = df, color = 'teal', ax=ax[0])
for container in p1.containers:
    ax[0].bar_label(container, fontsize=10)

p2 = sns.countplot(x ='position_group', data = df, color = 'teal', stat = 'percent', ax=ax[1] )
for container in p2.containers:
    ax[1].bar_label(container, fontsize=10, fmt= '{:.1f}%')

p1.set(xlabel='', ylabel='')
p2.set(xlabel='', ylabel='')
ax[0].set_title('Count by Position Group', pad = 7, fontsize=14)
ax[1].set_title('Percentage by Position Group', pad = 7, fontsize=14)
plt.show()
No description has been provided for this image

Observations:

Most people work for University, least number of people, more than ten times fewer, work for Athletics position group.

Gender and Position Groups¶

In [15]:
fig, ax = plt.subplots(1,2,figsize=(15,5), sharey=False, sharex=True, dpi = 200)
sns.despine(fig)

p1= sns.countplot(x ='position_group', hue = "gender", data = df, palette = "rocket", ax=ax[0])
for container in p1.containers:
    ax[0].bar_label(container, fontsize=10)

p2 = sns.countplot(x ='position_group', hue = "gender", data = df, palette = "rocket", stat = 'percent', ax=ax[1] )
for container in p2.containers:
    ax[1].bar_label(container, fontsize=10, fmt= '{:.1f}%')

p1.set(xlabel='', ylabel='')
p2.set(xlabel='', ylabel='')
ax[0].set_title('Count by Position Groups', pad = 7, fontsize=14)
ax[1].set_title('Percentage by Position Groups', pad = 7, fontsize=14)
ax[0].get_legend().set_visible(False)

plt.show()
No description has been provided for this image

Observations:

  • As we can see, University shows near parity between genders, suggesting a more balanced distribution in academic or administrative roles.
  • The Health System is a female-dominated sector, indicating a possible gender trend in professions related to health within this dataset.
  • Athletics remains male-dominated, which aligns with traditional gender roles in sports-related fields.

Analyze numerical data¶

Overview¶

In [16]:
# Summary statistics for numerical columns
df.describe().apply(lambda s: s.apply('{0:.0f}'.format))
Out[16]:
regular_pay bonus other overtime gross_pay
count 41887 41887 41887 41887 41887
mean 68786 2340 2784 1082 74991
std 74811 16091 37500 4257 100274
min -2186 -1000 -2500 -15 0
25% 27608 0 0 0 29155
50% 55040 0 0 0 57726
75% 87366 600 18 192 91459
max 2433750 1385738 6923463 152487 9173463
In [17]:
# dataset with all the numerical columns
numcol = df.select_dtypes(include=["number"])
In [18]:
# detailed percentile data
numcol.quantile(np.arange(0.0, 1.1,0.1)).apply(lambda s: s.apply('{0:.1f}'.format))
Out[18]:
regular_pay bonus other overtime gross_pay
0.00 -2185.7 -1000.0 -2500.0 -15.0 0.0
0.10 5541.6 0.0 0.0 0.0 6179.8
0.20 19823.5 0.0 0.0 0.0 21059.2
0.30 34695.5 0.0 0.0 0.0 36303.6
0.40 44489.9 0.0 0.0 0.0 47003.2
0.50 55040.4 0.0 0.0 0.0 57726.0
0.60 66527.8 600.0 0.0 0.0 69668.4
0.70 79660.0 600.0 0.0 25.8 83249.1
0.80 96683.2 600.0 330.0 509.4 101939.3
0.90 125688.7 2250.0 4476.1 2426.4 136504.0
1.00 2433750.1 1385737.5 6923462.6 152487.4 9173462.6

Observations:

  • The mean is much higher than the median.
  • There is a substantial gap between the 90th-percentile and 100th-percentile values. These are indicators that we are not dealing with normally distributed data.

Visualizations of distribution¶

In [19]:
# Create boxplots
fig = make_subplots(rows=5, cols=1, subplot_titles=("Regular Pay", "Bonus", "Overtime", "Other Compensation", "Gross Pay"))
fig.add_trace(go.Box(x=df.regular_pay), row=1, col=1)
fig.add_trace(go.Box(x=df.bonus), row=2, col=1)
fig.add_trace(go.Box(x=df.overtime), row=3, col=1)
fig.add_trace(go.Box(x=df.other), row=4, col=1)
fig.add_trace(go.Box(x=df.gross_pay), row=5, col=1)

fig.update_layout(height=1000, width=1100, showlegend=False)
fig.update_yaxes(showticklabels=False)
fig.show()
In [20]:
# Create histograms
fig, axs = plt.subplots(5, 1, figsize=(15, 15))
sns.despine(fig)
axs[0].hist(df.regular_pay, bins=60, color='blue', edgecolor='black', alpha=0.7)
axs[0].axvline(x=df.regular_pay.mean(), color='red', linestyle='dashed', linewidth=2, label='Mean')
axs[0].axvline(x=df.regular_pay.median(), color='black', linestyle='dashed', linewidth=2, label='Median')
axs[0].legend()
axs[0].set_title('Regular Pay')

axs[1].hist(df.bonus, bins=60, color='skyblue', edgecolor='black')
axs[1].axvline(x=df.bonus.mean(), color='red', linestyle='dashed', linewidth=2, label='Mean')
axs[1].axvline(x=df.bonus.median(), color='black', linestyle='dashed', linewidth=2, label='Median')
axs[1].legend()
axs[1].set_title('Bonus')

axs[2].hist(df.overtime, bins=180, color='hotpink', edgecolor='black')
axs[2].axvline(x=df.overtime.mean(), color='red', linestyle='dashed', linewidth=2, label='Mean')
axs[2].axvline(x=df.overtime.median(), color='black', linestyle='dashed', linewidth=2, label='Median')
axs[2].legend()
axs[2].set_title('Overtime')

axs[3].hist(df.other, bins=80, color='yellow', edgecolor='black')
axs[3].axvline(x=df.other.mean(), color='red', linestyle='dashed', linewidth=2, label='Mean')
axs[3].axvline(x=df.other.median(), color='black', linestyle='dashed', linewidth=2, label='Median')
axs[3].legend()
axs[3].set_title('Other Compensation')

axs[4].hist(df.gross_pay, bins=80, color='indigo', edgecolor='black', alpha=0.7)
axs[4].axvline(x=df.gross_pay.mean(), color='red', linestyle='dashed', linewidth=2, label='Mean')
axs[4].axvline(x=df.gross_pay.median(), color='black', linestyle='dashed', linewidth=2, label='Median')
axs[4].legend()
axs[4].set_title('Gross Pay')

# Adjust layout
plt.tight_layout()

# Display the plot
plt.show()
No description has been provided for this image

Observations¶

  • The negative minimum values are normal as the dataset description says "amounts represent paid earnings, including corrections to prior payrolls and in some cases may reflect negative values".
  • The mean is higher than the median, indicating a right-skewed distribution.
  • The standard deviation is quite large, indicating high variability (especially in regular pay).
  • There are some extreme outliers. This needs to be investigated and potentially be removed or capped to prevent them from skewing analyses.
  • Many employees do not receive bonuses, do not earn overtime or receive other compensation since the 25th and 50th percentiles are 0.

Highest Paying Jobs¶

In [21]:
df3 = df[df['job'].map(df['job'].value_counts()) >= 5]
# Get the top 10 job titles with the highest mean salary
top_10 = df3.groupby('job')['gross_pay'].mean().nlargest(10)

# Create a single bar plot for the top 10 highest paying job titles and their mean salaries
plt.figure(figsize=(12, 6))
sns.barplot(x=top_10.index, y=top_10.values, color='skyblue')
sns.despine(top=True, right=True, left=False, bottom=False)
# Customize the plot
plt.xlabel('Job Title')
plt.ylabel('Mean Gross Pay')
plt.title('Top 10 Highest Paying Jobs')
plt.xticks(rotation=25, ha = 'right') 

# Show the plot
plt.tight_layout()
plt.show()
No description has been provided for this image

Gross Pay: categories distribution¶

Let's see what the distribution of the quality column looks like:

In [22]:
df1 = df.copy()
In [23]:
# Discretize variables into buckets
bins = [0, 25000, 50000, 75000, 100000, 150000, 350000, 500000, 1000000, 10000000]
labels = ['0-25K', '25-50K', '50-75K', '75-100K', '100-150K', '150-350K', '350K-500K', '500K-1M', '1M-10M']
df1['gross_pay_c'] = pd.cut(df1['gross_pay'], bins, labels= labels)
In [24]:
ax = df1.gross_pay_c.value_counts().sort_index().plot.barh(title= 'Distribution of Gross Pay Categories', 
                                                           figsize=(10, 4), color = 'teal')
ax.axes.invert_yaxis()
for bar in ax.patches:
    ax.text(bar.get_width(), bar.get_y() + bar.get_height()/2, f'{bar.get_width()/df.shape[0]:.1%}', verticalalignment='center')
ax.spines[['right', 'top']].set_visible(False) 
plt.xlabel('')
plt.ylabel('')
plt.show()
No description has been provided for this image

Observations:

  • The distribution is skewed towards the lower pay ranges, with over 60% of employees earning under 75K.
  • A relatively small percentage of employees are in the highest pay ranges, indicating that higher salaries (above 150K) are rare in this dataset reflecting a typical salary distribution where a smaller number of employees earn higher wages.
  • The concentration of salaries in the lower pay brackets may suggest a wage disparity across different positions or departments.

Gross Pay by Position Group¶

Mean and Median Gross Pay by Position Group¶

In [25]:
fig, axes = plt.subplots(1, 2, figsize=(15, 6), sharey=True, tight_layout = True)

ax = sns.barplot(ax=axes[0], x='position_group', y='gross_pay', data=df,  estimator=np.mean, errorbar = None, palette = "magma")
for i in ax.containers:
    ax.bar_label(i, fontsize=12, padding=5, fmt='$%0.f')
axes[0].set_title('MEAN Gross Pay by Position Group', pad = 10, fontweight ='bold')
ax.set(xlabel='', ylabel='')
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)

ax2 = sns.barplot(ax=axes[1], x='position_group', y='gross_pay', data=df,  estimator=np.median, errorbar = None, palette = "magma")
for i in ax2.containers:
    ax2.bar_label(i, fontsize=12, padding=5, fmt='$%0.f')
axes[1].set_title('MEDIAN Gross Pay by Position Group', pad = 10, fontweight ='bold')
ax2.set(xlabel='', ylabel='')
ax2.spines["top"].set_visible(False)
ax2.spines["right"].set_visible(False)
No description has been provided for this image

Observations

  • The University group has a very high pay disparity (larger gap between mean and median), which could indicate income inequality within this group.
  • The Health System group shows the most consistent salary distribution, with the mean and median closely aligned.
  • The Athletics group displays an extreme skew, where most employees earn very low wages, but a few individuals with much higher salaries inflate the mean considerably.

The Athletics group's median being so low suggests that it has a large number of part-time, temporary, or lower-wage roles with only a few individuals earning significantly more, which leads to the skew.

Mean and Median Gross Pay By Gender and Position Group¶

In [26]:
fig, axes = plt.subplots(2,1, figsize=(12, 8), sharey=True)
fig.tight_layout(pad=5)
ax = sns.barplot(ax=axes[0], x='position_group', y='gross_pay', hue='gender', data=df, 
                 estimator=np.mean, errorbar = None, palette = "Purples")
for i in ax.containers:
    ax.bar_label(i, fontsize=10, padding=5, fmt='$%0.f')
axes[0].set_title('MEAN Gross Pay by Gender and Position Group', pad = 10, fontweight ='bold')
ax.set(xlabel='', ylabel='')
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)

ax2 = sns.barplot(ax=axes[1], x='position_group', y='gross_pay', hue='gender', data=df, 
                  estimator=np.median, errorbar = None, palette = "Purples")
for i in ax2.containers:
    ax2.bar_label(i, fontsize=10, padding=5, fmt='$%0.f')
axes[1].set_title('MEDIAN Gross Pay by Gender and Position Group', pad = 10, fontweight ='bold')
ax2.set(xlabel='', ylabel='')
ax2.spines["top"].set_visible(False)
ax2.spines["right"].set_visible(False)
No description has been provided for this image

Observations:

Gender Pay Gap: In all three groups (University, Health System, and Athletics), males consistently earn more than females (excluding the median in Health System), although the size of the gap varies significantly.

  • University: The gender pay gap is present in both mean and median pay, with males earning more.
  • Health System positions have the smallest gender gap, and the median indicates females earn slightly more.
  • Athletics: There is a significant difference in mean pay between genders, but the median pay is relatively similar, indicating that outliers may be skewing the mean.

These observations point to a persistent gender pay gap across different position groups, with certain sectors (like Athletics) showing a much more severe disparity than others.

Gross Pay and Gender¶

Mean Gross Pay By Gender¶

In [27]:
average_salary = df.groupby(["gender"])["gross_pay"].mean().reset_index().sort_values(by='gross_pay', ascending=False)
average_salary
Out[27]:
gender gross_pay
1 Male 82573.18
0 Female 69683.90

A very pronounced gender pay disparity can be observed here.

Gross Pay categories by gender¶

In [28]:
# Create a contingency table
cross_gross = pd.crosstab(df1.gender, df1.gross_pay_c, margins=True)
cross_gross
Out[28]:
gross_pay_c 0-25K 25-50K 50-75K 75-100K 100-150K 150-350K 350K-500K 500K-1M 1M-10M All
gender
Female 4916 5410 5458 4090 3274 1265 152 71 3 24639
Male 4371 3247 3465 2201 2066 1356 308 207 27 17248
All 9287 8657 8923 6291 5340 2621 460 278 30 41887
In [29]:
fig = plt.figure(figsize = (12,4), tight_layout = True)
ax = sns.countplot(x ='gross_pay_c', hue = "gender", data = df1, palette = "winter")
for container in ax.containers:
    ax.bar_label(container, fontsize=10, padding=3)
ax.set(xlabel='', ylabel='')
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.set_title('Gross Pay & Gender', pad = 10, fontweight ='bold')
plt.show()
No description has been provided for this image

Observations:

The chart shows that while females are more prevalent in low-to-mid salary ranges, males increasingly dominate higher gross pay brackets, especially in the top earning categories, indicating a gender disparity in higher-paying roles.

I conducted a more in-depth analysis of the gender pay gap in a separate project. You can find it here https://github.com/AlexHarrods/Ohio-State-University-Gender-Pay-Gap-2023-

Outliers¶

Boxplot visualisation¶

In [72]:
# boxplot by gender
fig = plt.figure(figsize = (12,4), tight_layout = True)
ax = sns.boxplot(x=df1.gross_pay, palette = 'magma')
ax.set(xlabel='', ylabel='')
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.set_title('Gross Pay Boxplot', pad = 15, fontweight ='bold')
plt.show()
No description has been provided for this image
In [73]:
# Let's remove some extreme outliers to see better
fig = plt.figure(figsize = (12,4), tight_layout = True)
ax = sns.boxplot(data=df.loc[df.gross_pay<1500000], x = 'gross_pay', palette = 'magma')
ax.set(xlabel='', ylabel='')
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.set_title('Gross Pay Boxplot without Most Extreme Outliers', pad = 15, fontweight ='bold')
plt.show()
No description has been provided for this image

Observations:

  1. Significant Number of Outliers: There is a heavy presence of outliers on the higher end of the salary range. The gross pay extends as high as 9 million, while the majority of the data is clustered near the lower end, which means that the dataset contains some extremely high earners.
  2. Extreme Outliers: A few extreme values stand out, even among the outliers. There are points that are far removed from the cluster of outliers between 200,000 to 1 million gross pay, especially those in the 4 million to 8 million range.
  3. Tight Main Data Range: The bulk of the data is very tightly packed near the lower end of the scale. This indicates that a large percentage of employees have relatively low gross pay. This contrasts sharply with the few very high salaries, which are far removed from the rest of the data.

Boxplots by Position Group¶

In [88]:
fig = plt.figure(figsize = (12,5), tight_layout = True)
ax = sns.boxplot(data=df, x = 'gross_pay', y = 'position_group', palette = 'magma')
ax.set(xlabel='', ylabel='')
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.set_title('Gross Pay Boxplot by Position Group', pad = 15, fontweight ='bold')
plt.show()
No description has been provided for this image
In [94]:
# Let's remove some extreme outliers to see better
fig = plt.figure(figsize = (12,6), tight_layout = True)
ax = sns.boxplot(data=df.loc[df.gross_pay<3000000], x = 'gross_pay', y = 'position_group', palette = 'magma')
ax.set(xlabel='', ylabel='')
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.set_title('Gross Pay Boxplot by Position Group without Extreme Outliers', pad = 15, fontweight ='bold')
plt.show()
No description has been provided for this image

Observations

All groups show outliers, though the distribution in terms of these high-earners varies significantly between them.

  • Athletics has the largest outliers by a wide margin, with the highest-paid individual earning over $9M. This group demonstrates the most significant pay disparity.
  • University has a substantial number of outliers, indicating a more diverse salary range within this position group. Several high-paid individuals earn over $1M.
  • Health System shows a few high-paid individuals as well, but the range of outliers is more contained compared to the other two groups.

Boxplots by Position Group and Gender¶

In [92]:
fig = plt.figure(figsize = (12,5), tight_layout = True)
ax = sns.boxplot(data=df, x="gross_pay", y="position_group", hue = "gender", palette = 'magma')
ax.set(xlabel='', ylabel='')
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.set_title('Gross Pay Boxplot by Position Group and Gender', pad = 15, fontweight ='bold')
plt.show()
No description has been provided for this image
In [93]:
fig = plt.figure(figsize = (12,5), tight_layout = True)
ax = sns.boxplot(data=df.loc[df.gross_pay<3000000], x="gross_pay", y="position_group", hue = "gender", palette = 'magma')
ax.set(xlabel='', ylabel='')
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.set_title('Gross Pay Boxplot by Position Group and Gender without Extreme Outliers', pad = 15, fontweight ='bold')
plt.show()
No description has been provided for this image

Observations:

  • Males consistently occupy more of the higher-paying positions across all position groups, with several extreme outliers well above $1M. This trend is especially prominent in Athletics, where male outliers far surpass their female counterparts in gross pay.
  • The most extreme male outliers are found in the Athletics group, with earnings exceeding $8M. This suggests high compensation concentration among a select few males in this group, likely driven by prominent roles (e.g., coaching).
  • Female outliers are consistently present but earn significantly less, with few crossing the $1M mark. This reflects a consistent underrepresentation of females in top-earning positions across all groups.
  • The disparity in extreme outliers between genders is most pronounced in Athletics, where male outliers vastly outnumber and out-earn female outliers.

This visualization suggests that addressing gender pay disparities, especially in leadership and high-earning positions, remains an area of concern.

Skewness and kurtosis¶

Skewness and kurtosis describe how symmetrical the distribution is and how fat the tails of the distribution are, respectivelylly

In [32]:
numcol.skew()
Out[32]:
regular_pay     5.45
bonus          28.62
other         155.76
overtime        8.95
gross_pay      22.90
dtype: float64
In [33]:
numcol.kurtosis()
Out[33]:
regular_pay      76.05
bonus          1610.58
other         28007.99
overtime        129.32
gross_pay      1697.63
dtype: float64

Both measures are significantly higher than we would expect if our variables were distributed normally.

Correlation¶

If a variable of interest is significantly correlated with another variable, we should take that relationship into account when trying to identify outliers. Let's see which ones are more corellated.

In [34]:
pd.options.display.float_format = '{:.2f}'.format
corr_mat = numcol.corr()
corr_mat
Out[34]:
regular_pay bonus other overtime gross_pay
regular_pay 1.00 0.55 0.23 0.05 0.92
bonus 0.55 1.00 0.11 -0.02 0.61
other 0.23 0.11 1.00 -0.02 0.56
overtime 0.05 -0.02 -0.02 1.00 0.07
gross_pay 0.92 0.61 0.56 0.07 1.00
In [35]:
# Select correlations greater than 0.5
high_corr = corr_mat[abs(corr_mat) >= 0.5]

# Plot correlation heatmap
plt.figure(figsize=(10,5))
sns.heatmap(high_corr,
            annot=True,
            fmt='.1f',
            cmap='GnBu',
            vmin=0.5,
            vmax=1)
title = plt.title('Correlation Heatmap')
No description has been provided for this image

Regular Pay and Gross Pay are highly correlated. We also see smaller (0.6) but still substantial correlation between Regular Pay and Bonus, Gross Pay and Bonus, Gross Pay and Other.

Bivariate Outliers: Gross Pay and Regular Pay¶

Let's investigate the relationship and see if there's anything unusual.

In [36]:
# do a scatterplot of regular_pay by gross_pay
ax = sns.regplot(x="gross_pay", y="regular_pay", data=df1, line_kws={"color": "red"})
ax.set(title="Regular Pay and Gross Pay")
plt.show()
No description has been provided for this image
In [37]:
# see if they have unexpected low or high values
df1['regular_pay_qq'] = pd.qcut(df1['regular_pay'], labels=['very low','low','medium', 'high','very high'], q=5, precision=0)
df1['gross_pay_qq'] = pd.qcut(df1['gross_pay'], labels=['very low','low','medium', 'high','very high'], q=5, precision=0)
In [38]:
cross = pd.crosstab(df1.regular_pay_qq, df1.gross_pay_qq)
cross
Out[38]:
gross_pay_qq very low low medium high very high
regular_pay_qq
very low 8175 185 9 2 7
low 203 7674 458 36 6
medium 0 518 7340 496 23
high 0 0 570 7341 466
very high 0 0 0 502 7876
In [39]:
df1.loc[(df1.regular_pay_qq=="very low") & (df1.gross_pay_qq=="very high") | (df1.regular_pay_qq=="very low") & (df1.gross_pay_qq=="high")]. T
Out[39]:
5064 13930 18137 19303 22224 37356 37376 37564 39297
full_name Thomas Ryan Meghan Cook Mary Cairns Kevin Wilson Michael Schmid David Yablok Robert Bornstein John Antenucci Heather Thomas
gender Male Female Female Male Male Male Male Male Female
position_group Health System Health System Health System Athletics Health System Health System University Health System Health System
job Physician Intermittent Physician Intermittent Physician Assistant Coach-Major Sports Intermittent Physician Intermittent Physician Retiree-Faculty Emeritus Intermittent Physician Intermittent Physician
hierarchy Health Sciences CCH6 Health Sciences CCH6 Health Sciences CCH6 Athletics and Business Advancement CCH6 Health Sciences CCH6 Health Sciences CCH6 Medicine CCH6 Health Sciences CCH6 Health Sciences CCH6
cost_center CC11301 Health Sciences | FGP IM Cardiovascula... CC14267 Health Sciences | Anesthesiology Services CC12915 Health Sciences | FGP Psychiatry CC12637 Athletics | Football CC12868 Health Sciences | FGP Anesthesiology CC12868 Health Sciences | FGP Anesthesiology CC12838 Medicine | Psychiatry CC12868 Health Sciences | FGP Anesthesiology CC12868 Health Sciences | FGP Anesthesiology
regular_pay 19817.66 0.00 0.00 5303.09 0.00 0.00 1200.00 0.00 0.00
bonus 0.00 5697.33 0.00 297500.00 0.00 0.00 0.00 0.00 0.00
other 109761.60 109742.33 103872.00 161687.67 94431.00 300775.00 86900.00 310105.00 303353.00
overtime 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
gross_pay 129579.26 115439.66 103872.00 464490.76 94431.00 300775.00 88100.00 310105.00 303353.00
gross_pay_c 100-150K 100-150K 100-150K 350K-500K 75-100K 150-350K 75-100K 150-350K 150-350K
regular_pay_qq very low very low very low very low very low very low very low very low very low
gross_pay_qq very high very high very high very high high very high high very high very high

There are people with very low Regular Pay but very high Gross Pay.

In [40]:
df1.loc[(df1.job=="Intermittent Physician") & (df1.regular_pay > 0)]
Out[40]:
full_name gender position_group job hierarchy cost_center regular_pay bonus other overtime gross_pay gross_pay_c regular_pay_qq gross_pay_qq
1282 Jason Bryant Male University Intermittent Physician Medicine CCH6 CC12848 Medicine | Anesthesiology 811.37 0.00 6383.06 0.00 7194.43 0-25K very low very low
5999 Jessica Short Female Health System Intermittent Physician Health Sciences CCH6 CC12915 Health Sciences | FGP Psychiatry 184767.30 23206.00 49291.60 0.00 257264.90 150-350K very high very high
21188 Anita Chang Female Health System Intermittent Physician Health Sciences CCH6 CC12915 Health Sciences | FGP Psychiatry 132545.25 4702.03 22083.43 0.00 159330.71 150-350K very high very high
21984 Geoffrey Thomas Male University Intermittent Physician Medicine CCH6 CC12848 Medicine | Anesthesiology 33612.00 0.00 14025.00 0.00 47637.00 25-50K low medium

Observations¶

  • 6 of the 9 individuals are Intermittent Physicians, suggesting they might not be full-time employees, which could explain the low regular pay and high "other" compensation.
  • Kevin Wilson, an Assistant Coach in Major Sports, has a different structure with a notable bonus and other compensations contributing to the gross pay.
  • Robert Bornstein, listed as "Retiree-Faculty Emeritus," has a small regular_pay but his gross pay is primarily from other categories. This might be linked to retirement or emeritus status benefits.
  • The table includes both male and female employees. However, the male employees seem to dominate the higher gross_pay values.

Conclusion:
Certain roles, especially those in the Health System as Intermittent Physicians, receive most of their income through compensation mechanisms other than regular salary. The stark difference between Regular Pay and Gross Pay highlights a unique compensation structure within these roles.

Using Isolation Forest to find outliers¶

Isolation Forest offers some advantages over classical techniques like Z-score or Interquartile Range (IQR) for datasets with skewed distributions, heavy tails, and lots of outliers (the algorithm does not make any assumptions about the underlying distribution of the data and works effectively regardless of skewness and heavy tails).

In [41]:
# create a standardized analysis data frame
standardizer = StandardScaler()
if_analysis = df1[['full_name', 'gender', 'position_group', 'regular_pay', 'bonus','other','gross_pay', 'overtime']]
if_analysisstand = standardizer.fit_transform(if_analysis.iloc[:, 3:])
In [42]:
# run an isolation forest model to detect outliers
clf=IsolationForest(n_estimators=100, max_samples='auto', max_features=1.0, contamination = 0.1)
clf.fit(if_analysisstand)
if_analysis['anomaly'] = clf.predict(if_analysisstand)
if_analysis['scores'] = clf.decision_function(if_analysisstand)
if_analysis.anomaly.value_counts()
Out[42]:
anomaly
 1    37699
-1     4188
Name: count, dtype: int64

4189 individuals (or 10%) are identified as outliers (they have anomaly values of -1.) - we set the contamination to 0.1

In [43]:
# view the outliers
inlier, outlier = if_analysis.loc[if_analysis.anomaly==1], if_analysis.loc[if_analysis.anomaly==-1]
outlier[['full_name', 'gender', 'position_group','regular_pay','bonus','other', 'gross_pay','scores']].sort_values(['scores']).head(5)
Out[43]:
full_name gender position_group regular_pay bonus other gross_pay scores
3242 Eugene Smith Male Athletics 1676222.04 575000.00 537000.00 2788222.04 -0.35
19302 Ryan Day Male Athletics 2000000.04 250000.00 6923462.58 9173462.62 -0.35
15356 Kevin Mcguff Male Athletics 675000.00 110000.00 391533.48 1176533.48 -0.34
7794 Mina Makary Female University 422064.40 266412.14 122715.56 811192.10 -0.33
10470 Arpit Nagar Male University 420307.98 198604.16 110305.88 729218.02 -0.32
In [44]:
# plot the inliers and outliers
fig = plt.figure(figsize=(7, 7))
ax = plt.axes(projection='3d')
ax.set_title('Isolation Forest Anomaly Detection')
ax.set_zlabel("Bonus")
ax.set_xlabel("Gross Pay")
ax.set_ylabel("Regular Pay")
ax.scatter3D(inlier.gross_pay, inlier.regular_pay, inlier.bonus, label="inliers", c="blue")
ax.scatter3D(outlier.gross_pay, outlier.regular_pay, outlier.bonus, label="outliers", c="red")
ax.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

Conclusion on Outliers¶

The dataset has a considerable number of outliers on the higher end of the salary distribution. These outliers are likely skewing the analysis, inflating the mean Gross Pay compared to the median. They could be impacting the effectiveness of predictive models, particularly those sensitive to extreme values. Further analysis might involve removing or capping outliers to prevent them from distorting results or applying robust models (e.g., RANSAC or Huber Regression) that can handle such outliers effectively.

Models¶

Let's see how different algorithms deal with outliers

In [45]:
# Separating the dataset
X = df[['regular_pay', 'bonus', 'other']]
y = df.gross_pay

# Splitting data into 25% training and 75% test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
In [46]:
X.shape, y.shape # original data
Out[46]:
((41887, 3), (41887,))
In [47]:
X_train.shape, y_train.shape # training data
Out[47]:
((31415, 3), (31415,))
In [48]:
X_test.shape, y_test.shape # testing data
Out[48]:
((10472, 3), (10472,))
In [49]:
# Define the models
models = [
    ('Random Forest', RandomForestRegressor(n_estimators=20)),
    ('Decision Tree', DecisionTreeRegressor(max_depth=10, min_samples_split=2, random_state=0)),
    ('Linear Regression', LinearRegression()),
    ('RANSAC Boosting', RANSACRegressor(random_state=42)),
    ('Huber Regression', HuberRegressor())
]
best_model = None
best_score = -np.inf
best_RMSE = np.inf

# Model training, evaluation, and selection
print("Model performance:")
for name, model in models:
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    score = model.score(X_test, y_test)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    
    print(f"{name} - score: {score:.3f}, MSE : {mse:.0f}, MAE : {mae:.0f}, RMSE : {rmse:.0f}")
    
    if score > best_score and rmse < best_RMSE:
        best_score = score
        best_model = name
        best_RMSE = rmse
        
print(f"Best model: {best_model} with score: {best_score:.3f} and RMSE: {best_RMSE:.0f}")
Model performance:
Random Forest - score: 0.994, MSE : 47141304, MAE : 2141, RMSE : 6866
Decision Tree - score: 0.970, MSE : 241246573, MAE : 3215, RMSE : 15532
Linear Regression - score: 0.998, MSE : 17605709, MAE : 1734, RMSE : 4196
RANSAC Boosting - score: 0.998, MSE : 17593195, MAE : 1611, RMSE : 4194
Huber Regression - score: 0.998, MSE : 18961768, MAE : 1113, RMSE : 4355
Best model: RANSAC Boosting with score: 0.998 and RMSE: 4194

Observations:

  1. RANSAC Boosting and Linear Regression perform best: Both RANSAC Boosting and Linear Regression yield the highest scores (0.998) with relatively low RMSE values (4194 for RANSAC, 4196 for Linear Regression). This indicates these models handle outliers effectively, maintaining accuracy despite data irregularities.

  2. Huber Regression is robust to outliers: Huber Regression, specifically designed for datasets with outliers, also performs well, with a score of 0.998 and the lowest MAE (1113). Its RMSE (4355) is slightly higher than RANSAC Boosting, but its low MAE suggests it handles extreme values by minimizing their influence more efficiently.

  3. Decision Tree struggles with outliers: The Decision Tree model has the lowest score (0.970) and the highest error metrics (MSE: 241,246,573; RMSE: 15,532). This suggests that Decision Trees are more sensitive to outliers, which can lead to overfitting or skewed predictions.

  4. Random Forest performs reasonably well: Random Forest achieves a strong score (0.992), but its RMSE (8121) is significantly higher than the best-performing models. While it handles outliers better than Decision Trees, it still struggles compared to more robust models like RANSAC and Huber Regression.

  5. Outlier impact on performance: The models designed or adapted to handle outliers (RANSAC Boosting and Huber Regression) consistently outperform others, particularly in terms of error reduction. This highlights the importance of using robust techniques when dealing with skewed data or significant outliers.

In summary, RANSAC Boosting emerges as the best performer, with Linear Regression and Huber Regression also showing strong resilience to outliers. Decision Trees, however, are less robust in this context.

Final Analysis¶

The Ohio State University (OSU) 2023 Earnings dataset reveals significant insights into earnings distribution, gender representation, and position-based disparities:

  1. Gender Representation: Females (58.8%) outnumber males (41.2%), reflecting higher female employment, particularly in Health System position group. Despite this, a gender pay gap persists, with males dominating higher salary brackets, especially in Athletics.
  2. Position Groups: Most employees work in the University group (52.8%), while Athletics employs significantly fewer people (5.1%). The smaller Athletics group contains highly paid outliers, contributing to salary skewness in this sector.
  3. Gender and Position Groups: The University shows near gender parity, but Athletics remains male-dominated, aligning with traditional gender roles. The Health System is female-dominated, reflecting broader gender trends in healthcare professions.
  4. Negative Values: The dataset includes numerous negative values, especially in certain pay categories (up to 70% in some), unlike the Ohio University dataset, which contains no negative values. This may reflect differing accounting processes.
  5. Gender Pay Gap: despite a higher female representation overall (58.8%), a persistent gender pay disparity exists across various position groups. Males consistently earn more than females, particularly in higher salary brackets and in sectors like Athletics, where outliers further inflate male earnings. The gap is smaller in the Health System, where females even slightly outperform males at the median level.
  6. Earnings Distribution and Outliers: The dataset is highly right-skewed, with extreme outliers driving up the mean gross pay. Most employees earn under $75K, while a few high earners skew the overall salary distribution. These outliers inflate average earnings and complicate analysis.
  7. Model Performance: Robust models like RANSAC Boosting and Huber Regression handle outliers effectively, achieving high accuracy and minimizing errors. Models like Decision Trees struggle with outliers, indicating the need for specialized techniques in handling skewed data.

Conclusion: The OSU dataset highlights challenges like gender pay gaps, skewed salary distributions, and the presence of extreme outliers. Addressing these issues requires robust modeling techniques and further investigation into compensation structures, particularly in Athletics and Healthcare.